Goto

Collaborating Authors

 zero-shot segmentation


Learning Mask-aware CLIP Representations for Zero-Shot Segmentation (Supplementary material) Anonymous Author(s) Affiliation Address email

Neural Information Processing Systems

In the supplementary material, we first introduce technical details of the "frozen CLIP" approaches in Sec. 1. Then the dataset settings are shown in Sec. 2. Figure 1 presents an overview of the "frozen CLIP" approach. It's worth noting that all sub-images are resized to Figure 2: Comparison among three merge operations. Pascal-VOC, COCO-Stuff and ADE20K, to evaluate the performance of MAFT. Pascal-VOC: There are 10582 images for training and 1,449 images for testing. ADE20K: ADE20K contains 25k images for training and 2k images for validation. Pascal-Context is an extensive dataset of Pascal-VOC 2010.


Learning Mask-aware CLIP Representations for Zero-Shot Segmentation

Neural Information Processing Systems

Recently, pre-trained vision-language models have been increasingly used to tackle the challenging zero-shot segmentation task. Typical solutions follow the paradigm of first generating mask proposals and then adopting CLIP to classify them. To maintain the CLIP's zero-shot transferability, previous practices favour to freeze CLIP during training. However, in the paper, we reveal that CLIP is insensitive to different mask proposals and tends to produce similar predictions for various mask proposals of the same image.


Consistent Structural Relation Learning for Zero-Shot Segmentation

Neural Information Processing Systems

Zero-shot semantic segmentation aims to recognize the semantics of pixels from unseen categories with zero training samples. Previous practice [1] proposed to train the classifiers for unseen categories using the visual features generated from semantic word embeddings. However, the generator is merely learned on the seen categories while no constraint is applied to the unseen categories, leading to poor generalization ability. In this work, we propose a Consistent Structural Relation Learning (CSRL) approach to constrain the generating of unseen visual features by exploiting the structural relations between seen and unseen categories. We observe that different categories are usually with similar relations in either semantic word embedding space or visual feature space.


Review for NeurIPS paper: Consistent Structural Relation Learning for Zero-Shot Segmentation

Neural Information Processing Systems

Summary and Contributions: Post rebuttal update I originally gave this paper an '8' and I will keep my original rating. The method is a good improvement upon [1]: it extends [1] with a simple and reproducable idea. Experimentally they demonstrate good improvements over [1]. In contrast to R3, I think that is not only a decent amount of novelty, but also the simple kind of novelty that is likely to be adopted by other reviewers. The other two main weaknesses highlighted by several reviewers were: 1) A better positioning w.r.t.


Review for NeurIPS paper: Consistent Structural Relation Learning for Zero-Shot Segmentation

Neural Information Processing Systems

Paper originally received a set of somewhat mixed reviews from four reviewers, with scores: 8, 5, 5, 6. Generally, the reviewers liked the work, commenting on how it addressed an important problem [R3] and presented a well-motivated idea [R1] that was novel [R2], simple and reproducible [R1]; ultimately resulting in good results [R1,R2,R3,R4]. Some shortcoming were also identified, including (1) unclear positioning and potential limited novelty with respect to [1] [R1,R2,R3] and (2) lack of sufficient comparisons to related work [R2,R3,R4]. Authors have provided a very through rebuttal that addressed all major concerns; providing compelling clarification of novelty (1) and additional experiments to address reviews comments for (2). As a result R2 and R3 raised their scores arriving at the final unanimously positive ratings for the paper of: 8, 7, 6, 6. AC has read the reviews, the rebuttal, resulting discussion and the paper itself.


Learning Mask-aware CLIP Representations for Zero-Shot Segmentation

Neural Information Processing Systems

Recently, pre-trained vision-language models have been increasingly used to tackle the challenging zero-shot segmentation task. Typical solutions follow the paradigm of first generating mask proposals and then adopting CLIP to classify them. To maintain the CLIP's zero-shot transferability, previous practices favour to freeze CLIP during training. However, in the paper, we reveal that CLIP is insensitive to different mask proposals and tends to produce similar predictions for various mask proposals of the same image. This issue mainly relates to the fact that CLIP is trained with image-level supervision.


MedCLIP-SAMv2: Towards Universal Text-Driven Medical Image Segmentation

Koleilat, Taha, Asgariandehkordi, Hojat, Rivaz, Hassan, Xiao, Yiming

arXiv.org Artificial Intelligence

Segmentation of anatomical structures and pathological regions in medical images is essential for modern clinical diagnosis, disease research, and treatment planning. While significant advancements have been made in deep learning-based segmentation techniques, many of these methods still suffer from limitations in data efficiency, generalizability, and interactivity. As a result, developing precise segmentation methods that require fewer labeled datasets remains a critical challenge in medical image analysis. Recently, the introduction of foundation models like CLIP and Segment-Anything-Model (SAM), with robust cross-domain representations, has paved the way for interactive and universal image segmentation. However, further exploration of these models for data-efficient segmentation in medical imaging is still needed and highly relevant. In this paper, we introduce MedCLIP-SAMv2, a novel framework that integrates the CLIP and SAM models to perform segmentation on clinical scans using text prompts, in both zero-shot and weakly supervised settings. Our approach includes fine-tuning the BiomedCLIP model with a new Decoupled Hard Negative Noise Contrastive Estimation (DHN-NCE) loss, and leveraging the Multi-modal Information Bottleneck (M2IB) to create visual prompts for generating segmentation masks from SAM in the zero-shot setting. We also investigate using zero-shot segmentation labels within a weakly supervised paradigm to enhance segmentation quality further. Extensive testing across four diverse segmentation tasks and medical imaging modalities (breast tumor ultrasound, brain tumor MRI, lung X-ray, and lung CT) demonstrates the high accuracy of our proposed framework. Our code is available at https://github.com/HealthX-Lab/MedCLIP-SAMv2.


Consistent Structural Relation Learning for Zero-Shot Segmentation

Neural Information Processing Systems

Zero-shot semantic segmentation aims to recognize the semantics of pixels from unseen categories with zero training samples. Previous practice [1] proposed to train the classifiers for unseen categories using the visual features generated from semantic word embeddings. However, the generator is merely learned on the seen categories while no constraint is applied to the unseen categories, leading to poor generalization ability. In this work, we propose a Consistent Structural Relation Learning (CSRL) approach to constrain the generating of unseen visual features by exploiting the structural relations between seen and unseen categories. We observe that different categories are usually with similar relations in either semantic word embedding space or visual feature space.


MedCLIP-SAM: Bridging Text and Image Towards Universal Medical Image Segmentation

Koleilat, Taha, Asgariandehkordi, Hojat, Rivaz, Hassan, Xiao, Yiming

arXiv.org Artificial Intelligence

Medical image segmentation of anatomical structures and pathology is crucial in modern clinical diagnosis, disease study, and treatment planning. To date, great progress has been made in deep learning-based segmentation techniques, but most methods still lack data efficiency, generalizability, and interactability. Consequently, the development of new, precise segmentation methods that demand fewer labeled datasets is of utmost importance in medical image analysis. Recently, the emergence of foundation models, such as CLIP and Segment-Anything-Model (SAM), with comprehensive cross-domain representation opened the door for interactive and universal image segmentation. However, exploration of these models for data-efficient medical image segmentation is still limited, but is highly necessary. In this paper, we propose a novel framework, called MedCLIP-SAM that combines CLIP and SAM models to generate segmentation of clinical scans using text prompts in both zero-shot and weakly supervised settings. To achieve this, we employed a new Decoupled Hard Negative Noise Contrastive Estimation (DHN-NCE) loss to fine-tune the BiomedCLIP model and the recent gScoreCAM to generate prompts to obtain segmentation masks from SAM in a zero-shot setting. Additionally, we explored the use of zero-shot segmentation labels in a weakly supervised paradigm to improve the segmentation quality further. By extensively testing three diverse segmentation tasks and medical image modalities (breast tumor ultrasound, brain tumor MRI, and lung X-ray), our proposed framework has demonstrated excellent accuracy. Code is available at https://github.com/HealthX-Lab/MedCLIP-SAM.


Increasing SAM Zero-Shot Performance on Multimodal Medical Images Using GPT-4 Generated Descriptive Prompts Without Human Annotation

Jiang, Zekun, Cheng, Dongjie, Qin, Ziyuan, Gao, Jun, Lao, Qicheng, Li, Kang, Zhang, Le

arXiv.org Artificial Intelligence

This study develops and evaluates a novel multimodal medical image zero-shot segmentation algorithm named Text-Visual-Prompt SAM (TV-SAM) without any manual annotations. TV-SAM incorporates and integrates large language model GPT-4, Vision Language Model GLIP, and Segment Anything Model (SAM), to autonomously generate descriptive text prompts and visual bounding box prompts from medical images, thereby enhancing SAM for zero-shot segmentation. Comprehensive evaluations are implemented on seven public datasets encompassing eight imaging modalities to demonstrate that TV-SAM can effectively segment unseen targets across various modalities without additional training, significantly outperforming SAM AUTO and GSAM, closely matching the performance of SAM BBOX with gold standard bounding box prompts, and surpassing the state-of-the-art on specific datasets like ISIC and WBC. The study indicates that TV-SAM serves as an effective multimodal medical image zero-shot segmentation algorithm, highlighting the significant contribution of GPT-4 to zero-shot segmentation. By integrating foundational models such as GPT-4, GLIP, and SAM, it could enhance the capability to address complex problems in specialized domains. The code is available at: https://github.com/JZK00/TV-SAM.